Mechanism for handling explicit writeback in a cache coherent multi-node architecture

ABSTRACT

A method and apparatus for a mechanism for handling explicit writeback in a cache coherent multi-node architecture is described. In one embodiment, the invention is a method. The method includes receiving a read request relating to a first line of data in a coherent memory system. The method further includes receiving a write request relating to the first line of data at about the same time as the read request is received. The method further includes detecting that the read request and the write request both relate to the first line. The method also includes determining which request of the read and write request should proceed first. Additionally, the method includes completing the request of the read and write request which should proceed first.

RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.10/896,151 filed on Jul. 20, 2004, which is a continuation of U.S.application Ser. No. 09/823,791, filed on Mar. 31, 2001, entitled“Mechanism for Handling Explicit Writeback in a Cache CoherentMulti-Node Architecture.”

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to communications between integrated circuits andmore specifically to data transfer and coherency in a multi-node ormulti-processor system.

2. Description of the Related Art

Processors and caches have existed since shortly after the advent of thecomputer. However, the move to using multiple processors has posed newchallenges. Previously, data existed in one place (memory for example)and might be copied into one other place (a cache for example). Keepingdata coherent between the two possible locations for the data was arelatively simple problem. Utilizing multiple processors, multiplecaches may exist, and each may have a copy of a piece of data.Alternatively, a single processor may have a copy of a piece of datawhich it needs to use exclusively.

If two copies of the data exist, or one copy exists aside from theoriginal, some potential for a conflict in data exists in amulti-processor system. For example, a first processor with exclusiveuse of a piece of data may modify that data, and subsequently a secondprocessor may request a copy of the piece of data from memory. If thefirst processor is about to write the piece of data back to memory whenthe second processor requests the piece of data, stale data may be readfrom memory, or corrupted data may be read from the write. The staledata results when the write should have completed before the readcompleted (but did not), thus allowing the read instruction to causeretrieval of the updated data. The corrupted data may result when theread retrieval of the updated data. The corrupted data may result whenthe read should have completed before the write completed (but did not),thus allowing the read instruction to cause retrieval of the data priorto the update.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and notlimitation in the accompanying figures.

FIG. 1 illustrates a block diagram of an embodiment of a system havingmultiple processors.

FIG. 2 illustrates a block diagram of an alternate embodiment of asystem having multiple processors.

FIG. 3 illustrates a block diagram of an embodiment of an I/O(input/output) subsystem.

FIG. 4 illustrates a block diagram of an embodiment of a scalabilityport.

FIG. 5 illustrates a flow diagram of an embodiment of a read-writeconflict.

FIG. 6A illustrates a flow diagram of an embodiment of a process ofhandling a read-write conflict.

FIG. 6B illustrates a flow diagram of an embodiment of a process ofhandling a read-write conflict.

FIG. 7 illustrates a flow diagram of an embodiment of a processincluding a read-write conflict.

FIG. 8A illustrates a flow diagram of an embodiment of a processsuitable for resolving a read-write conflict.

FIG. 8B illustrates a flow diagram of an alternate embodiment of aprocess suitable for resolving a read-write conflict.

FIG. 9 illustrates a block diagram of an embodiment of a processorhaving portions of a scalability port integrated therein.

FIG. 10 illustrates a block diagram of an alternate embodiment of aprocessor having portions of a scalability port integrated therein.

DETAILED DESCRIPTION

A method and apparatus for a mechanism for handling explicit writebackin a cache coherent multi-node architecture is described. In thefollowing description, for purposes of explanation, numerous specificdetails are set forth in order to provide a thorough understanding ofthe invention. It will be apparent, however, to one skilled in the artthat the invention can be practiced without these specific details. Inother instances, structures and devices are shown in block diagram formin order to avoid obscuring the invention.

Reference in the specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least one embodimentof the invention. The appearances of the phrase “in one embodiment” invarious places in the specification are not necessarily all referring tothe same embodiment, nor are separate or alternative embodimentsmutually exclusive of other embodiments.

A coherent data architecture should reduce conflicts between nodeswithin the architecture which need to read and write data at about thesame time. For example, processor (or node) A may be reading a firstdata line for purposes of a calculation at the same time the processor Bmay be writing the first data line. In some instances, these conflictswill resolve themselves, but attempting to let the conflicts resolvethemselves randomly might lead to a non-deterministic system. Therefore,it is preferable to resolve read-write conflicts such as these in amanner which is predictable.

Read-write conflicts may be resolved by sending reads and writes throughsome sort of controller or port, such as a scalability port. Within theport, addresses of reads and writes may be compared, such that conflictsmay be detected. When a conflict is detected, a decision may be made asto whether to stall the read or the write. Such a decision may be madebased on a variety of factors, depending on the design of the system,and may consider such things as when the requests were received by theport, the priority of the requests, the nature of the requests, andother considerations. Once a decision is made, one of the conflictingoperations will complete, and then the other will complete. Since thedecision making will be hardwired, any given situation will have apredictable result, and users of the system (such as system designersand programmers) may adapt their use to the predictable result.

FIG. 1 illustrates a block diagram of an embodiment of a system havingmultiple processors. A first processor 110 and a second processor 120are coupled to a node controller 130, and the first processor 110 may becoupled directly to the second processor 120 as well. Memory 150 is alsocoupled to node controller 130. Furthermore, scalability ports 135 areused to couple node controller 130 to I/O (input/output) hub 140, whichin turn may be coupled to various I/O devices (not shown). In such anembodiment, the scalability ports 135 may be used to control accesses tosecondary and higher level storage devices, as well as maintain cachecoherency within the system. In one embodiment, each of the processor110, processor 120 and node controller 130 have an associated onboardcache.

Processors typically have caches incorporated within or associated withthem, such that a processor may be viewed as including a cache. Inmulti-processor systems, it is not uncommon to have caches associatedwith each processor which maintain data lines in one of four states,those states being exclusive, shared, modified, or invalid. Exclusivestate is for data lines in use by that processor and locked or otherwiseallowed for use by that processor only within the system. Shared stateis for data lines which are in use by the processor but may be used byother processors. Modified state is for data lines in use by theprocessor which have a data value the processor has modified from itsoriginal value. Invalid state is for data lines which have beeninvalidated within the cache. Invalidation may occur when a processorwrites a line to memory or when another processor takes a shared linefor exclusive use, thus calling into question the validity of the datain the copy of the line the first processor has.

FIG. 2 illustrates a block diagram of an alternate embodiment of asystem having multiple processors. A first processor 210 and a secondprocessor 220 are coupled to a first node controller 230. Also coupledto the first node controller 230 is a first memory 250. A thirdprocessor 210 and a fourth processor 220 are coupled to a second nodecontroller 230. Also coupled to the second node controller 230 is asecond memory 250. Additionally, coupled separately to the first nodecontroller 230 and the second node controller 230 are a first and secondscalability port switch and snoop filter 260. Furthermore, coupled toeach of the first and second scalability port switches 260 are a firstand second I/O hub 240. In one embodiment, each of the processors 210,processors 220, node controllers 230 and I/O hubs 240 have an associatedonboard cache.

FIG. 3 illustrates a block diagram of an embodiment of an I/O(input/output) subsystem. I/O hub 310 is coupled to a PCI bus 320 whichin turn is coupled to a PCI device or devices 330. I/O hub 310 is alsocoupled to an AGP (accelerated graphics port) 340, which in turn iscoupled to an AGP device or devices 350. It will be appreciated thatnumerous implementations of the PCI bus and the AGP exist, any of whichmay work with various I/O hubs such as I/O hub 310.

FIG. 4 illustrates a block diagram of an embodiment of a scalabilityport. The scalability port, in one embodiment, includes a first andsecond node controller 405 and a switch and snoop filter 450. Each nodecontroller 405 includes a memory control block 410, a bus logic block415, an IRB (incoming request buffer) block 420 and an ORB (outgoingrequest buffer) block 425, each of which is coupled to the three othercomponents. Furthermore, the node controller 405 includes a port 430which is coupled to the IRB 420 and the ORB 425. Also, the memorycontrol block 410 may be coupled to a memory for interfacing therewithand the bus logic block 415 may be coupled to a first and secondprocessor for interfacing therewith for example. The switch 450 includesa first and second port 455, each of which are coupled to a switch 460,and a snoop pending table and snoop filter block 465.

In one embodiment, incoming requests and outgoing requests are generatedand responded to by devices outside the scalability port. Each requestis routed through the appropriate node controller 405, such thatincoming requests (to the port 430) are placed in the IRB 420 andoutgoing requests (to the port 430) are placed in the ORB 425.Additionally, within the switch 450, each port 455 receives incoming andoutgoing requests which are routed through the switch 460. Theserequests may be targeted at another node coupled to the switch 450, ormay be targeted at a node coupled to another switch 450, in which casethe request may either be routed to the appropriate node or ignoredrespectively. Determining whether the target of the request is coupledto the switch 450 is the function of the snoop filter and table 465,which may be expected to maintain information on what data (by addressfor example) is being utilized by the nodes coupled to the switch 450.

The scalability port may be utilized to minimize the problem ofread-write conflicts, as described below. Note that the discussion ofreads and writes focuses on reading and writing lines, which typicallyrefer to lines of data such as those stored in a cache (either onboardor associated with a processor for example). It will be appreciated thatlines of data may refer to various amounts of data, depending on how asystem is implemented to transfer data.

FIG. 5 illustrates a flow diagram of an embodiment of a read-writeconflict. At block 510, a first line is read by a first node, while atblock 520, at about the same time, the first line is written by a secondnode. At block 530, speculative reads of the first line occur. At block540, the write of the first line completes, while at block 550, at aboutthe same time, the read of the first line completes. With twodisconnected processes for the read and write of the same line, it isnot clear whether the read and write resulted in proper data being reador written. For example, in some situations, the data should be readbefore it is written, whereas in other situations, the data should bewritten before it is read. However, typically it is important that theread receive the most up-to-date data possible.

FIG. 6A illustrates a flow diagram of an embodiment of a process ofhandling a read-write conflict. At block 610, a read operation on afirst line is commenced by a first node. At block 620, a write operationon the first line is commenced by a second node, at about the same timethat the read operation is commenced. At block 630, the conflict betweenthe read and write is detected, such as by comparing the addresses ofthe read and write requests for example. At block 640, the write isallowed to complete, and the read is delayed so that it will receive oruse the data which is written. At block 650, once the write operationhas completed, the read operation is allowed to complete, using the datawritten to the first line by the write operation.

FIG. 6B illustrates a flow diagram of an embodiment of a process ofhandling a read-write conflict. At block 610, a read operation on afirst line is commenced by a first node. At block 620, a write operationon the first line is commenced by a second node, at about the same timethat the read operation is commenced. At block 670, the conflict betweenthe read and write is detected, such as by comparing the addresses ofthe read and write requests for example. At block 675, the writeoperation is stalled. At block 680, the read operation is completed,with the read operation receiving the data to be written from the writeoperation. At block 685, the written data is invalidated at the secondnode. At block 690, if appropriate, the read operation completes thewrite operation without the involvement of the second node, such asthrough the scalability port. Note that in some instances, the readoperation need not complete the write operation, because in somesituations the first node will eventually have to write the data it hasread, and that will effectively complete the write operation. As will beappreciated, this has the potential to save some of the effort ofwriting the data back twice, once for the write operation from thesecond node and once for the write operation the first node willeventually complete with the data it reads.

It will be appreciated that a variety of methods may be used todetermine which of the two processes of FIGS. 6A and 6B should beimplemented. In most systems, it will be appreciated that there will betimes when the process of FIG. 6A should be used and other times whenthe process of FIG. 6B should be used. However, the exact details fordetermining which process should be used will necessarily depend onimplementation details. In general, it will be appreciated that such asystem will determine which of the two processes to use by examining orcomparing when and where the two conflicting instructions (read andwrite) originated and what priorities or properties are assigned to theinstructions, such as exclusive or shared use of the subject data forexample. Moreover, in some embodiments, an arbitration deviceindependent of the nodes may determine whether to delay the read orwrite operation.

The embodiment described in the following section is implemented using aspecific protocol. It will be appreciated that such a protocol may beimplemented in a variety of ways which will be apparent to one skilledin the art. Furthermore, it will be appreciated that variations on sucha protocol may be implemented within the spirit and scope of theinvention.

Coherent Request Types

In some embodiments, a particular protocol is implemented including themethod or by the apparatus in question. In these embodiments, thecoherent requests supported on the scalability port are listed in thefollowing table. The table lists all the requests that are used by thecoherence protocol, and those requests are then discussed in thefollowing text. In the discussion in this section, a line indicates thelength of a coherence unit. TABLE 1 Coherent Request Types AllowedRequest Type Name Description Targets Read PRLC Port Read Line Code HomePRLD Port Read Line Data Node, PRC Port Read Current Coherence Read PRILPort Read and Invalidate Line Con-troller Invalidate PRILO Port Read andInvalidate Line with OWN# set PIL Port Invalidate Line PFCL Port FlushCache Line PILND Port Invalidate Line No Data Memory PMWI, Port MemoryWrite. I/E/S indicates Update PMWE, state of line at the requestingnode. PMWS When data is sent along with memory updates, it is indicatedwith PMW[I/E/S]_D. Cache Line PCLR Port Cache Line Replacement CoherenceReplacement (Completion not required) Controller PCLRC Port Cache LineReplacement, Completion Required Snoop PSLC/PSLD Port Snoop LineCode/Data Coherence PSC Port Snoop Current Controller, Snoop PSIL PortSnoop Invalidate Line Any Invalidate PSILO Port Snoop Invalidate Linewith Caching OWN# set Node PSFCL Port Snoop Flush Cache Line PSILND PortSnoop Invalidate Line No Data Memory Read PMR Port Memory Read HomeRequest PMRS Port Memory Read Speculative Node PMRSX Port Memory ReadSpeculative Cancel

The Port Read Line (PRLC, PRLD and PRC) requests are used to read acache line. They are used to both read form from memory and snoop thecache line in the caching agent(s) at the target node. The Port Readrequests are always targeted to the coherence controller or the homenode of a memory block. A node that is not the home if the blockaddressed by the transaction never receives a Port Read request. Thecode and data read and read current requests are different to facilitatedifferent cache state transitions. The Port Read Current (PRC) requestis used to fetch the most current copy of a cache line without changingthe ownership of the cache line from the caching agent (typically usedby an I/O node).

The Port Read and Invalidate Line (PRIL and PRILO) requests are used tofetch an exclusive copy of a memory block. They are used to both readfrom memory and snoop invalidate a cache line in the caching agent(s) atthe node. The Port Read and Invalidate requests are always targeted tothe coherence controller or the home node of a memory block. A node thatis not home of the block addressed by the transactions never receivesthese requests. These two request types differ in their behavior whenthe memory block is found in the modified state at the snooped node. Fora PRIL request, the data is supplied to the requesting node and the homememory is updated, whereas for a PRILO request, the data is suppliedonly to the source node, the home memory is not updated (the requestingnode must cache the line in “M” state for PRILO).

The Port Invalidate Line (PIL) request is a special case of the PRILrequest with zero length. This request is used by the requesting node toobtain exclusive ownership of a memory block already cached at therequesting node (for example when writing to a cache line in Sharedstate). Data can never be returned as a response to a PIL request on thescalability port. Due to concurrent invalidation requests, if the lineis found modified at a remote caching node, then this condition must bedetected either by the requesting node controller or the coherencecontroller and the PIL request must be converted to a PRIL request. ThePIL request is always targeted to the coherence controller or the homenode of the requested memory block. A node that is not home of the blockaddressed by the transaction never receives this request.

The Port Flush Cache Line (PFCL) request is a special case of the PILrequest used to flush a memory block from all the caching agents in thesystem and update the home memory if the block is modified at a cachingagent. The final state of all the nodes, including the requesting node,is Invalid and home memory has the latest data. This request is used tosupport the IA64 flush cache instruction. This request is alwaystargeted to the coherence controller or the home node of the memoryblock. A node that is not home of the block addressed by the transactionnever receives this request.

The Port Invalidate Line No Data (PILND) request is used by therequesting node to obtain exclusive ownership of a memory block withoutrequesting data. The memory block may or may not be present at therequesting node. The memory block is invalidated in all other nodes inthe system. If the line is modified at a remote caching node, then thehome memory is updated but data is not returned to the requesting node.This request is intended to be used for efficient handling of full linewrites which the requesting node does not intend to keep in its cache(for example I/O DMA writes). This request is always targeted to thecoherence controller of the home node of the requested memory block. Anode that is not home of the block addressed by the transaction neverreceives this request.

The Port Memory Write (PMWI_D, PMWE_D, PMWS_D) requests with Data areused to update the content of home memory and the state of the line inthe coherence controller. Corresponding Port Memory Write (PMWI, PMWE,PMWS) requests without data are used to update the state of the line inthe coherence controller. A PMW[IIE/S] request with or without data doesnot snoop the caching agent(s) at the node. These requests are verysimilar in nature except for the state of the line at the originatingnode. The PMWI request indicates that the memory block is no longercached at the originating node, the PMWS request indicates that the lineis in a shared state at the originating node and the PMWE requestindicates that the line is in exclusive state at the originating node.The PMW[I/E/S] requests are always targeted to the coherence controlleror the home node of the memory block.

The Port Cache Line Replacement (PCLR, PCLRC) requests are used toindicate to the coherence controller that the node no longer has a copyof the memory block in the caching agents at that node. They areintended to be used only by the originating node of the transaction.These requests are always targeted to the coherence controller tofacilitate better tracking of the cache state by the coherencecontroller. A node can generate a PCLR or PCLRC request only when thestate of the cache line at the node changes from S or E to I, generationof these requests when the cache line state at a node is I is notallowed to avoid starvation or livelock on accesses from other nodes. APCLR or PCLRC request could be dropped or processed by the receivingagent without affecting its final state. The protocol supports twoversions of this request to facilitate implementation optimizationdepending on the type of network implemented. The PCLR request does notexpect any response back from the receiving agent and the requestingagent can stop tracking this request in its outbound queue as soon as itis sent on the scalability port. The PCLRC request expects a completionresponse back from the receiving agent and is tracked in the requestingagent till this response is received. Implementation should use thePCLRC request if it cannot guarantee sequential ordering betweenrequests from the requesting node to the coherence controller over thenetwork in order to properly handle race conditions between this requestand subsequent reads to the same line. If the implementation canguarantee sequential ordering between requests over the network betweentwo nodes, it can use the PCLR request to save network bandwidth (nocompletion response) and for reduced buffer requirements in the outboundqueue at the requesting node.

The Port Snoop (PSLC, PSLD and PSC) requests are used to initiate asnoop request at a caching node. The snoops caused by the code or datasnoop request and the read current request are different to facilitatedifferent cache state transitions. The Port Snoop requests could betargeted to any caching node. These requests do not have any effect onthe home memory blocks, they only affect the state of a memory block inthe caching agents at the target node.

The Port Snoop (PSLC, PSLD and PSC) requests are used to initiate asnoop request at a caching node. The snoops caused by the code or datasnoop request and the read current request are different to facilitatedifferent cache state transitions. The Port Snoop requests could betargeted to any caching node. These requests do not have any effect onthe home memory blocks, they only affect the state of a memory block inthe caching agents at the target node.

The Port Snoop and Invalidate (PSIL, PSILO and PSILND) requests are usedto snoop and invalidate a memory block at a caching node. These requestscould be targeted to any caching node. These three request types differin their behavior when the memory block is found in the modified stateat the snooped node. For PSIL request, data is supplied to both thesource node and the home memory is updated. For PSILO request, the datais supplied only to the source node, the home memory is not updated. ForPSILND request, only the home memory is updated, the data is notsupplied to the requesting node.

The Port Snoop Flush Cache Line (PSFCL) request is used to flush amemory block from all the caching agents and update the home memory ifthe block is modified at a caching agent. This request is used tosupport the IA64 flush cache instruction and to facilitate backwardinvalidates due to snoop filter evictions at the coherence controller.The PSFCL request could be targeted to any caching node.

The Port Memory Read (PMR) and Port Memory Read Speculative (PMRS)requests are used to read a home memory block. These requests are usedto read memory and do not cause a snoop of caching agent(s) at the homenode. They are always targeted to the home node of a memory block. ThePMRS request is a speculative request whereas PMR is a non-speculativerequest. The Port Memory Read Speculative Cancel (PMRSX) request is usedto cancel a speculative read request (PMRS) to a cache line. A PMRSrequest depends on a non-speculative request for the same cache line forconfirmation. It is confirmed by a PMR, PRLC, PRLD, PRC, PRIL, or PRILOrequest for the same cache line. The confirmation request may or may notbe due to the same transaction that caused the PMRS request. The PMRSrequest is cancelled by a PMW[I/E/S] or a PMRSX request for the samecache line. The cancellation request may or may not be due to the sametransaction that caused the PMRS request. The PMRS request can bedropped by the responding agent without any functional issue.

Response Types for Coherent Requests

Response types for coherent request transactions on the scalability portare listed in Table 2. These responses are used under normalcircumstances or could be combined with special circumstances withproper response status to indicate failed, unsupported or abortedrequests. TABLE 2 Responses for Coherent Requests Responses DescriptionPSNR[I/S/M/MS] Port Snoop Result. I/S/M/MS indicates state of the linein remote nodes (I = Invalid, S = Shared, M = Modified transitioned toInvalid, MS = Modified transitioned to Shared) PCMP Port CompletionResponse PRETRY Port Retry Response PDATA Port Normal Data ResponsePSNR[I/S/M/MS]_CMP Combined response for PSNR[I/S/M/MS] + PCMPPSNR[I/S/M/MS]_D Combined response for PSNR[I/S/M/MS] + PDATA PCMP_DCombined response for PCMP + PDATA PSNR[I/S/M/MS]_CMP_D Combinedresponse for PSNR[I/S/M/MS] + PCMP + PDATA

The Port Snoop Result (PSNR) response is used to convey the result ofsnoop back to the requesting node. PSNR response indicates if the linewas found in Modified state and the final state of the line at thesnooped agent. The state of the line could be Invalid (except for PRC orPSC) at the snooped caching agent(s) (PSNRI), Shared (except for PRC orPSC) at the snooped caching agent(s) (PSNRS), Modified transitioning toInvalid (except for PRC or PSC) at the snooped caching agent (PSNRM) orModified transitioning to Shared at the snooped caching agent (PSNRMS).For a PRC or PSC transaction, if the cache line state at node is E, S,or I then either a PSNRI or PSNRS response is allowed; if the cache linestate is M then either a PSNRM of PSNRMS response if allowed.

The Port Completion (PCMP) response is used in determining thecompletion of a transaction under certain protocol conditions. Thisresponse can be generated only by the home node of the memory block orby the coherence controller for some transactions such as PRC, PSC.PRILO and PSILO.

The Port Retry (PRETRY) response is the protocol level retry response.The corresponding request is retried from the requesting node. Thisresponse is used to resolve conflict cases associated with multipletransactions to the same memory block. When the requesting agentreceives the PRETRY response to a PMWx request, it retries the PMWxrequest if no conflict has been detected. If the requesting agent hasalready seen the conflict before it receives the PRETRY response, thePMWx request is converted into a response to the incoming request.

The Port Normal Data (PDATA) response is used to return the datarequested by the corresponding read request. It does not have any otherprotocol level state information apart from the source node identifierand the transaction identifier of the request to associate it with theproper request.

The protocol also supports certain combined responses which could beused by the responding node to optimize use of bandwidth on SP. ThePSNR[I/S/M/MS]_CMP response is same as PSNR[I/S/M/MS]+PCMP, thePSNR[I/S/M/MS]D response is same as PSNR[I/S/M/MS]+PDATA, the PCMP_Dresponse is same as PCMP+PDATA and the PSNR[I/S/M/MS]_CMP_D response issame as PSNR[I/S/M/MS]+PCMP+PDATA.

FIG. 7 illustrates a flow diagram or timing diagram 700 of an embodimentof a process including a read-write conflict. Initially, Node Ainitiates a PRL (read operation) and at about the same time, Node Binitiates a PMWI (write operation). The write operation in question willinvalidate the line in the Node B cache. However, due to thesimultaneous nature of the operations, the read operation does a PMRS(speculative read) to the home node and a PSL (snoop line) to checkstatus of the line in the Node B cache. The resulting PSNRI indicatesthat the line in the Node B cache is invalid (due to the PMWI) and thePMR reads the unwritten line from the Home Node. The PDATA completes theread by sending the data back to Node A. Then, the PMWI from the SPS tothe Home Node writes the data and the PCMP signals that the writecompleted successfully, ignoring the incorrect data sent to Node A.

FIG. 8A illustrates a flow diagram or timing diagram 800 of anembodiment of a process suitable for resolving a read-write conflict.The PRL from Node A initiates the read operation and the PMWI from NodeB initiates the write operation at about the same time, with the SPS(scalability port) receiving the PRL first. This time, the PMRS and PSLare set up to trigger the SPS to detect any conflict, and the PMWI issent back to Node B with a PRETRY, causing the PSL to meet the returningPRETRY and read the data which is about to be written to the Home Nodeof the line. The read operation completes with the PSNRM_D completion,and the write is completed as a result of the read operation with thePMWI operation, resulting in completion with the PCMP completions.

FIG. 8B illustrates a flow diagram or timing diagram 850 of an alternateembodiment of a process suitable for resolving a read-write conflict. APMWI (write operation) and PRL (read operation) are initiated, with theSPS receiving the PMWI first. The PMWI proceeds to the Home Node, andthe write completes with the PCMP operations. In the meantime, the readoperation is retried with the PRETRY and subsequent PRL operations,leading to the PSNRI (snoop invalidate), PMRS (speculative read) and PMR(read) operations, of which the PMR operation results in a PDATAcompletion with the newly written data. As will be appreciated, thedetermining factor in whether to complete the read or write first iswhich operation occurs at the SPS first.

Alternative Scalability Port Implementations

The following section addresses some of the alternative scalability portimplementations which may be utilized within the spirit and scope of theinvention. It will be appreciated that these are exemplary in naturerather than limiting. Other alternative embodiments will be apparent tothose skilled in the art.

FIG. 9 illustrates a block diagram of an embodiment of a processorhaving portions of a scalability port integrated therein. Such anembodiment need not implement the protocol addressed in the previoussection. In one embodiment, processor 900 includes scalability port nodecontroller 910 and scalability port switch 920. Scalability port nodecontroller 910 is suitable for coupling to a memory such as memory 930.Scalability port switch 920 is suitable for coupling to an I/O hub orinterface such as I/O hub 940. Scalability port node controller 910 andscalability port switch 920 may collectively include an incoming requestbuffer, outgoing request buffer, memory control logic, snoop pendingtable and snoop filter. In one embodiment, scalability port nodecontroller 910 includes an incoming request buffer, outgoing requestbuffer and memory control logic suitable for interfacing with memory930. In such an embodiment, scalability port switch 920 may also includea snoop pending table, snoop filter and i/o interface logic suitable forinterfacing with I/O hub 940. In such an embodiment, scalability portswitch 920 may couple to the incoming request buffer and outgoingrequest buffer of scalability port node controller 910, and include i/ointerface logic suitable for coupling to the I/O hub 940.

FIG. 10 illustrates a block diagram of an alternate embodiment of aprocessor having portions of a scalability port integrated therein. Inone embodiment, each instance of processor 1000 includes a scalabilityport node controller 1010 and scalability port switch 1020. Scalabilityport switch 1020 is part of scalability port node controller 1010, andcollectively the two components (1010, 1020) include an incoming requestbuffer, outgoing request buffer, and control logic. Scalability portswitch 1020 includes a snoop pending table, snoop filter, and i/ointerface logic suitable for coupling to an I/O hub or other i/o device,such as I/O hub 1040. Scalability port node controller 1010 includesmemory control logic suitable for interfacing with memory 1030. Notethat memory 1030 may be separate for each processor 1000 or sharedbetween two (or more) processors 1000.

In the foregoing detailed description, the method and apparatus of thepresent invention has been described with reference to specificexemplary embodiments thereof. It will, however, be evident that variousmodifications and changes may be made thereto without departing from thebroader spirit and scope of the present invention. In particular, theseparate blocks of the various block diagrams represent functionalblocks of methods or apparatuses and are not necessarily indicative ofphysical or logical separations or of an order of operation inherent inthe spirit and scope of the present invention. For example, the variousblocks of FIGS. 1 or 2 (among others) may be integrated into components,or may be subdivided into components. Similarly, the blocks of FIGS. 6Aor 7 (among others) represent portions of a method which, in someembodiments, may be reordered or may be organized in parallel ratherthan in a linear or step-wise fashion. The present specification andfigures are accordingly to be regarded as illustrative rather thanrestrictive.

1. An apparatus comprising: an incoming request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, an assigned priority, and a property comprising that the operation involves data that is for exclusive use, shared use, modified use, or is invalidated; an outgoing request buffer to store requests relating to read and write operations coupled to the incoming request buffer; bus logic configured to interface with a bus, the bus logic coupled to the incoming request buffer and the outgoing request buffer; a snoop pending table to contain entries related to cache lines coupled to the incoming request buffer and the outgoing request buffer; a snoop filter coupled to the snoop pending table; control logic to interface with and coupled to the incoming request buffer, the outgoing request buffer, and the bus logic, the control logic to compare addresses of requests of the incoming request buffer and outgoing request buffer and detect identical addresses among requests of the incoming request buffer and the outgoing request buffer, the control logic to stall a second request of the incoming request buffer and outgoing request buffer pending completion of a first request of the incoming request buffer and outgoing request buffer when the second request and the first request include identical addresses; and an arbitration device to determine which request should proceed first based on the property of the requests.
 2. The apparatus of claim 1 wherein: the outgoing request buffer to receive read requests and write requests from a bus through the bus logic.
 3. The apparatus of claim 2 wherein: the control logic to pass requests to the outgoing request buffer and incoming request buffer to read data from or write data to a cache associated with a processor.
 4. The apparatus of claim 2 further comprising: a memory controller to interface with and control a memory, the memory controller coupled to the incoming request buffer, the outgoing request buffer, the bus logic, and the control logic; and wherein: the control logic to pass requests to the memory controller to read data from or write data to the memory.
 5. The apparatus of claim 1 wherein: the arbitration device determines requests relating to a read operation should proceed first based on the property of the requests.
 6. The apparatus of claim 1 wherein: the arbitration device determines requests relating to a write operation should proceed first based on the property of the requests.
 7. A system comprising: a first processor; a second processor; a scalability port coupled through a bus to the first processor and coupled through the bus to the second processor, the scalability port including: an incoming request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, an assigned priority, and a property comprising that the operation involves data that is for exclusive use, shared use, modified use, or is invalidated; an outgoing request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, coupled to the incoming request buffer; bus logic to interface with the bus, the bus logic coupled to the incoming request buffer and the outgoing request buffer; a snoop pending table to contain entries related to cache lines coupled to the incoming request buffer and the outgoing request buffer; a snoop filter coupled to the snoop pending table; control logic to interface with and coupled to the incoming request buffer, the outgoing request buffer, and the bus logic, the control logic to compare addresses of requests of the incoming request buffer and outgoing request buffer and detect identical addresses among requests of the incoming request buffer and the outgoing request buffer, the control logic to stall a second request of the incoming request buffer and the outgoing request buffer pending completion of a first request of the incoming request buffer and the outgoing request buffer when the second request and the first request include identical addresses; and an arbitration device to determine which request should proceed first depending on the property of the requests.
 8. The system of claim 7 further comprising: a memory coupled to the scalability port; and wherein the scalability port further includes: a memory controller to interface with and control the memory, the memory controller coupled to the incoming request buffer, the outgoing request buffer, the bus logic, and the control logic; and wherein: the control logic to pass requests to the memory controller to read from or write data to the memory.
 9. The system of claim 7 wherein: the outgoing request buffer and incoming request buffer to receive read requests and write requests from the bus through the bus logic, the read requests and write requests each individually originating from one of the first processor or the second processor.
 10. The system of claim 7 wherein: the control logic to pass requests to the outgoing request buffer and to the incoming request buffer to write data to or read data from a cache associated with the first processor.
 11. The system of claim 10 wherein: the control logic further to pass requests to the outgoing request buffer and to the incoming request buffer to write data to or read data from a cache associated with the second processor.
 12. The system of claim 7 wherein: the arbitration device determines requests relating to a read operation should proceed first based on the property of the requests.
 13. The system of claim 7 wherein: the arbitration device determines requests relating to a write operation should proceed first based on the property of the requests. 